Machine learning reveals distinct gene signature profiles in lesional and nonlesional regions of inflammatory skin diseases

Analysis of gene expression from cutaneous lupus erythematosus, psoriasis, atopic dermatitis, and systemic sclerosis using gene set variation analysis (GSVA) revealed that lesional samples from each condition had unique features, but all four diseases displayed common enrichment in multiple inflammatory signatures. These findings were confirmed by both classification and regression tree analysis and machine learning (ML) models. Nonlesional samples from each disease also differed from normal samples and each other by ML. Notably, the features used in classification of nonlesional disease were more distinct than their lesional counterparts, and GSVA confirmed unique features of nonlesional disease. These data show that lesional and nonlesional skin samples from inflammatory skin diseases have unique profiles of gene expression abnormalities, especially in nonlesional skin, and suggest a model in which disease-specific abnormalities in “prelesional” skin may permit environmental stimuli to trigger inflammatory responses leading to both the unique and shared manifestations of each disease.

involves randomly selecting samples from the majority class, whereas the random oversampling strategy involves randomly duplicating examples from the minority class. SMOTE functions by randomly selecting samples from the minority class, finding its k nearest neighbors, randomly selecting a neighbor, and generating a synthetic sample at a randomly selected point between two samples in the feature space. As previously noted, we used random undersampling to trim the number of examples in the majority class then used SMOTE to oversample the minority class to balance class distribution. The purpose of all class balancing strategies was to have balanced representation of both classes for ML. The dataset was split into 70% training and 30% validation and class balancing strategies were applied on the training dataset. ML algorithms were then implemented, and evaluation matrices were noted. Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves were plotted using the matplotlib (Version 3.3.4) library of Python. A ROC curve is graphical way to visualize trade-off between sensitivity and specificity. High area under the curve represents a low false-positive rate and a high true-positive rate. A PR curve is a measure of classification when classes are imbalanced.
High area under the PR curve represents both high recall and high precision, where high precision relates to a low false-positive rate, and high recall relates to a low-false negative rate.
For our analysis, we were interested in features that contributed the most towards separation of classes, hence RF was chosen as the primary ML classifier because it gives impurity-based Feature correlation: Before carrying out binary ML classification, feature selection was necessary in order to remove noninformative or redundant features. We assessed feature redundancy by calculating the Pearson correlation between each feature and every other feature. Pearson correlation between features was computed using the cor function in R. corplot library in R was used to plot 22 Pearson correlation plots (figs. S7,S10,S14,S16,S25). In 13 of these correlation plots, there was a pair of highly correlated features (correlation coefficient > 0.8), and the feature with the lower correlation was removed using a greedy elimination approach; this allowed us to retain the most informative features for ML (table S3A). Pearson correlation plots were also plotted for keratinocytes gene signatures and T cell signatures (figs. S6,S12). High correlation between the keratinocyte gene signatures made them unsuitable for ML analysis ( fig. S6).
Statistical Analysis: Statistical differences between cohorts were evaluated using Welch's ttest for lesional disease versus control GSVA scores from a single dataset, nonlesional samples versus control GSVA scores from combined datasets, mean Z-scores of nonlesional samples versus mean Z-scores of control samples of a single gene signature and Paired t-test for lesional versus nonlesional comparison. The magnitude of this difference (the effect size) was estimated using Hedge's g calculated as below.
where, cohort 1 and cohort 2 could be either disease and their respective control samples of a single dataset or nonlesional samples and control samples from combined dataset or mean z scores of nonlesional samples and mean Z-scores of control samples of a single gene signature or lesional and their paired nonlesional samples of a single dataset. All the statistical analysis was carried out in using effectSize (version 0.8.1) and stats (version 3.6.2) libraries in R.  The original data set from GEO, as confirmed by the authors of the paper contains 6 batches H1-H6. We observed that batches H1-H4 were uploaded on GEO in the year 2017 and batches H5-H6 were uploaded on GEO in the year 2019. For the purpose of our analysis we have divided this dataset into two parts 2017 (H1-H4) and 2019(H5-H6). Before splitting into two, we normlised original eset with 11 HK genes. Method used for normalisation is as follows: Take mean expression of HK genes per sample and divide every gene expression value of that sample with mean value. Only lupus dataset with time points skin biopsies available. The clinical trial was carried out in two sequences. Sequence 1 (9 patients) were given Amgen 811, Sequence 2 (7 pateints) were given placebo. The skin biopsies were taken at day 0, day 15 and day 57. For the purpose of our analysis, Lesional and Non-lesional biopsies from both the sequences patients at day 0 without any drug treatment used. 1 outlier patient's L and NL biopsies were removed.